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Background: Designing amino acid sequences that are stable in a given target structure amounts 
to maximizing a conditional probability. A straightforward approach to accomplish this is a nested 
Monte Carlo where the conformation space is explored over and over again for different fixed se- 
quences, which requires excessive computational demand. Several approximate attempts to remedy 
this situation, based on energy minimization for fixed structure or high-T expansions, have been 
proposed. These methods are fast but often not accurate since folding occurs at low T. 

Results: We develop a multisequence Monte Carlo procedure, where both sequence and conforma- 
tion space are simultaneously probed with efficient prescriptions for pruning sequence space. The 
method is explored on hydrophobic/polar models. We first discuss short lattice chains, in order 
to compare with exact data and with other methods. The method is then successfully applied to 
lattice chains with up to 50 monomers, and to off-lattice 20-mers. 

Conclusions: The multisequence Monte Carlo method offers a new approach to sequence design 
in coarse-grained models. It is much more efficient than previous Monte Carlo methods, and is, as 
it stands, applicable to a fairly wide range of two-letter models. 
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1 Introduction 



The protein design problem amounts to finding an amino acid sequence given a target structure, 
which is stable in the target structure, and is able to fold fast into this structure. In a typical model 
the second requirement implies that stability must set in at not too low a temperature. Hence, 
one is led to consider the problem of finding sequences that maximize the stability of the target 
structure at a given temperature. In terms of a model described by an energy function E(r,a), 
where r = {ri,r2, ..,r/v} denotes the structure coordinates and a = {<7i,<72, ..,(tn} the amino acid 
sequence, this can be expressed as maximizing the conditional probability 

P(r |a) = -i-ex P [-£(r 0) a)/T], (1) 
Z[a) 

where r$ denotes the target structure, T the temperature and the partition function Z(a) is given 

by 

Z(a) = J2^M-E(r,a)/T}, (2) 

r 

Maximizing P(ro\a) with respect to a represents quite some challenge, since for any move in <r, the 
partition function Z{a) needs to be evaluated; each evaluation of P(ro|er) effectively amounts to a 
folding calculation for fixed sequence a. 

Different ways of handling this sequence optimization problem have been proposed and partly ex- 
plored in the context of coarse-grained (or minimalist) protein models, where amino acid residues 
represent the entities. The proposed methods fall into three categories: 



• E(ro, (r)-minimization |^, ||. If one simply ignores Z(a) in Eq. ([j]), one is left with the 
problem of minimizing E(ra,a). This is too crude, since for many coarse-grained models it 
implies that all er values line up to a homopolymer solution. This can be remedied by adding 
a constraint to E(ro, c) restricting the overall composition. This method is very fast since no 
exploration of the conformation space is involved, but it does fail for a number of examples 
even for small system sizes. 

• High-T expansion [|[ |[ [|. A more systematic approach is to approximate Z(a) with low- 
order terms in a cumulant or high-T expansion. This method is also fast, and slightly more 
accurate than the E(ro, cr)-minimization method, but can also fail since folding takes place at 
lowT. 

• Nested MC (NMC) [fjj. In order to avoid introducing uncontrolled approximations, one is 
forced to turn to Monte Carlo (MC) methods. The most straightforward MC approach is to 
use a normal fixed-er MC in r for estimating the Z{&) contribution to Eq. (|l|), which, however, 
leads to a nested algorithm with a highly non-trivial inner part. Although correct results have 
been reported for toy-sized problems, this approach is inhibitorily CPU time-consuming for 
larger problem sizes. 



In this paper we develop and explore an alternative MC methodology, Multisequence (MS) design, 
where the basic strategy is to create an enlarged configuration space; the sequence o becomes a 
dynamical variable pi. Hence, r and a are put on a more equal footing, which, in particular, 
enables us to avoid a nested MC. Early stages of this project were reported in ||. 
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The multisequence approach is explored on both a two-dimensional (2D) lattice model, the HP 
model of Lau and Dill |l(|, and a simple three-dimensional (3D) off-lattice models 11 , with very 



good results. As with any design method, one needs access to suitable target structures, and also to 
verify the results by folding calculations. For short chains in the 2D HP model, TV < 18, both these 
tasks are easy since all configurations can be enumerated. For longer lattice chains and off-lattice 
models powerful MC algorithms like simulated tempering JlJ, [l3|, |§| are needed for the verification. 

Our calculations for the HP model can be divided into two groups corresponding to short (N =16 
and 18) and long (N — 32 and 50) chains. The results for short chains are compared to exact 
enumerations, and we find that our method reproduces the exact results extremely rapidly. We 
also compare our results to those obtained by E(ro, ^-minimization and a high-T approach. It 
should be mentioned that for the former we scan through all possible fixed overall compositions, 
thereby giving this method a fair chance. Also, we make a detailed exact calculation illuminating 
the limitations of the high-T expansion approach. 

For larger N a "bootstrap" method is developed that overcomes the problem of keeping all possible 
sequences in the computer memory. The efficiency of this trick is illustrated for a N — 32 target 
structure, which is chosen "by hand". Finally, a N = 50 target structure is generated by using a 
design algorithm that aims at throwing away those sequences that are unsuitable for any structure. 
This N = 50 target structure is subsequently subject to our multisequence design approach, which 
readily finds a sequence with the target structure as its unique ground state. As a by-product, 
having access to good N — 50 sequences, we investigate the behavior at the folding transition 
which, to our knowledge, has not been studied before for comparable chain lengths. 

Earlier studies of the 3D off-lattice model jy], and a similar 2D model JlJ], have shown that the 
stability, as measured by the average size (S 2 ) of thermal structural fluctuations, is strongly sequence 
dependent. Here we perform design experiments using native structures of both stable (low (S 2 )) 
and unstable (high (S 2 )) sequences as target structures. The quality of the designed sequences is 
carefully examined by monitoring the thermal average of the mean-square distance to the target 
structure, (5 2 ). We find that the method consistently improves on (6q) and that it performs better 
than the E(ro, ^-minimization approach. 

This paper is organized as follows. Section |^ contains our multisequence approach together with 
implementation issues. In Sec. |^ the method is applied to the 2D HP model for sizes varying from 
N = 16 to 50. In this section, also comparisons between the different approaches are performed. 
The efficiency of the multisequence method is discussed in Sec. [|. In Sec. || 3D off-lattice model 
structures are designed, and Sec. [7] contains a brief summary. 



2 The Method 



2.1 Optimizing Conditional Probabilities 



Maximizing the conditional probability P(ro\a) of Eq. (Q) with respect to a for a given target 
structure ro is a challenge since it requires exploration of both conformation and sequence degrees 
of freedom. At high T this task can be approached by using a cumulant expansion of Z(cr), which 
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makes the problem much easier. Unfortunately, this is not the temperature regime of primary 
interest. In this paper we present an efficient MC-based procedure for sequence optimization at 
biologically relevant temperatures. 

The problem of maximizing P(ro|er) can be reformulated in terms of P(a\ro) by introducing a 
marginal distribution of c, P(cr), and the corresponding joint distribution P(r,a) = P(r\a)P(a). 
Assigning equal a priori probability to all the <r, i.e. P{&) — constant, one obtains 

P(r \a)= P ^ ro K p(a\r o)) (3) 

so maximizing P(ro\cr) is then equivalent to maximizing P(a\ro). 

In this paper we focus on the problem of designing a single structure ro- This is a special case of 
the more general problem of maximizing the probability 



E ( 4 ) 



reD 



for a group of desired structures, D. Note that for a general set D with more than one structure, 
this is not equivalent to maximizing X^eD P( a \ r )> smce 



P(a\r)P(r) 

Note that Eq. (||) differs from that of 0, where equivalence is assumed. 



reD reD ^ ' reD 



2.2 The Multisequence Method 

A MC-based method for optimization of P{ro\a) at general T has been proposed by Seno et al. 0. 
Their approach is based on simulated annealing in a with a chain-growth MC in r for each cr. This 
gives a nested MC which is prohibitively time-consuming except for very small systems. 

The multisequence method offers a fundamentally different approach. In this method one replaces 
the simulations of P(r\a) for a number of different fixed a by a single simulation of the joint 
probability distribution 

P(r,a) = ~exp[-g(<T)-E(r )t r)/T\, (6) 
Z = J2 e M-9^)]Z(a). (7) 

The parameters g(c) determine the marginal distribution 

P(a) = ±exp[-g(*)]Z(a) (8) 

and must therefore be chosen carefully. At first sight, it may seem that one would need to estimate 
Z(a) in order to obtain reasonable g(a). However, a convenient choice is 

g(cx) = -E(r ,a)/T, (9) 
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for which one has 



In other words, maximizing P(ro\a) is in this case equivalent to minimizing P(<j). This implies 
that bad sequences are visited more frequently than good ones in the simulation. This property 
may seem unattractive at a first glance. However, it can be used to eliminate bad sequences. The 
situation is illustrated in Fig. |l|. 

The idea of using the multisequence method for sequence design is natural since the task is to 
compare different sequences. Let us therefore stress that the method is not only convenient, but 
also efficient. The basic reason for this is that the system often can move more efficiently through 
conformation space if the sequence degrees of freedom are allowed to fluctuate. As a result, simulat- 
ing many sequences with the multisequence method can be faster than simulating a single sequence 
with standard methods, as will be shown in Sec. |J. Another appealing feature of the multisequence 
scheme is that the optimization of the desired quantity P{tq\u), which refers to a single structure, 
can be replaced by an optimization of the marginal probability P(cr). 

The basic idea of the multisequence method is the same as in the method of simulated tempering [ fl2| , 
Hp , H . The only difference is that in the latter it is the temperature rather than the sequence which 
is dynamical. It has been shown that simulated tempering is a very efficient method for fixed- 
sequence simulations in the HP model JT5| . In particular, it was applied to a N — 64 sequence with 
known ground state, for which other methods had failed to reach the ground state level. Simulated 
tempering was, by contrast, able to find the ground state. Below we use simulated tempering to 
check our sequence design results for long chains. 




Figure 1: The distribution P(r, a). The choice of g(a) in Eq. (||) makes P(ro, a) flat in a. A sequence 
not designing r$ will have maxima in P(ri\a) for ^ ro due to states with E(ri,a) < E(ro,cr). A 
sequence designing ro will have a unique maximum at r = ro in P(r\a), which for low T contains 
most of the probability. 
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2.3 Reducing the Sequence Set 



The simple scheme outlined above is normally of little use on its own. With a large number 
of sequences, it becomes impracticable, especially since bad sequences tend to dominate in the 
simulation. It is therefore essential to incorporate a procedure for removal of bad sequences. This 
elimination can be done in different ways. We will discuss two possibilities which will be referred to 
as P{<j)- and E(r, <j)-based elimination, respectively. Whereas both options are available for lattice 
models, P(<r)-based elimination is more appropriate for off-lattice models. 



2.3.1 P(cr)-Based Elimination 

P(cr)-based elimination relies on the fact that bad sequences have high P(cr) [see Eq. (|Tc|)]. The 
full design procedure consists in this case of a number of ordinary multisequence runs. After each 
of these runs P(cr) is estimated for all the N r remaining sequences, and those having 



are removed. Typical values of the parameter A are 1-2. 



2.3.2 E(r, cr)-based Elimination 

The procedure referred to as E(r, er)-based removes sequences that do not have the target structure 
ro as their unique ground state. For each conformation r ro visited in the simulation, it is checked, 
for each remaining sequence a, whether 



Those sequences for which Eq. ( fL2[ ) is true are removed. With this type of elimination, it may 
happen that one removes the sequence that actually maximizes P(ro|cr) at the design temperature 
— the best sequence at this temperature does not necessarily have ro as its unique ground state 
(for an example, see Fig. || below). This should not be viewed as a shortcoming of the method. If 
it happens, it rather means that the design temperature is too high. E(r, cr)-based elimination is 
free from statistical errors in the sense that a sequence that does have ro as its unique ground state 
cannot be removed. Hence, in a very long simulation the surviving sequences are, by construction, 
precisely those that have ro as their unique ground state. 



2.4 Restricted Search by Clamping 

For long chains it is not feasible to explore the entire sequence space. On the other hand, at least in 
a hydrophobic/hydrophilic model, there are typically several positions in the target structure where 
<7j is effectively frozen (see e.g. 0). As will be discussed below, it turns out that such positions 
can be easily detected by means of trial runs. 



P(cr) > A/N r 



(11) 



E{r,a) < E{r ,a). 



(12) 
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3 Lattice Model Results 



In this section we explore the multisequence approach on the HP model on the square lattice. In 
this context we also compare with and discuss other approaches; E(ro, ^-minimization and high-T 
expansions. 

The HP model contains two monomer types, H (hydrophobic) and P (hydrophilic/polar), and is 
defined by the energy function |l(| 

E(r,a) = -J2a i a j A(r i -r j ), (13) 

i<j 

where A(r,; — Tj) = 1 if monomers i and j are non-bonded nearest neighbors and otherwise. For 
hydrophobic and polar monomers, one has cr,; = 1 and 0, respectively. 

Our explorations naturally divide into two categories; N = 16 and 18, where finding suitable 
structures and verifying folding properties of the designed sequences is trivial, and N — 32 and 50, 
where this is not the case. 

For N < 18 the HP model can be solved exactly by enumeration. Hence such systems have been 
extensively used for gauging algorithm performances. In Table ^ properties for N — 16 and 18 
systems are listed |l^, [l9| . A structure is designable if there exists a sequence for which it represents 
a unique ground state. The fraction of designable structures drops sharply with N. Furthermore, 
it depends strongly upon local interactions ^9) . 





N = 16 


N= 18 


# of sequences (2 iv ) 

# of sequences with unique ground state 

# of structures 

# of designable structures 


65 536 
1 539 
802 075 
456 


262 144 
6 349 
5 808 335 
1 475 



Table 1: Sequence and structure statistics for the HP model for N ~ 16 and 18. 



For a given target structure ro , it is convenient to classify the sequences as "good" , "medium" or 
"bad" . Good sequences have ro as their unique ground state, whereas medium sequences have g > 1 
degenerate ground states, one of them being ro. Finally, bad sequences do not have ro as minimum 
energy structure. 

In our MC calculations, the elementary moves in r space are standard. Three types are used: one- 
bead, two-bead and pivot (^6). Throughout the paper, a MC sweep refers to a combination of N — 1 
one-bead steps, N — 2 two-bead steps and one pivot step. The new feature is that the r moves are 
combined with stochastic moves in cr. Each sweep is followed by one a update. The a update is an 
ordinary Metropolis step p7l . 
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3.1 N=16/18 



We have performed design calculations for a large number of different N — 16 and 18 target 
structures. Our results show that the multiscqucncc design method is able to reproduce the exact 
data very rapidly. Some examples illustrating this were reported in || . 

Our calculations for N = 16 and 18 are carried out using E(r, <r)-based elimination. Those sequences 
that survive the elimination are compared by determining their relative weights P(cr), see Eq. (|l0|). 
The stochastic a moves are essential in the second part of these calculations, when estimating P(cr), 
but the first part, the elimination, could in principle be done without using these moves. In H it 
was shown, however, that it is advantageous to include the stochastic a moves in the first part as 
well. The efficiency is higher and less T dependent when these moves are included. 

To make sure that the success reported in || was not accidental, we applied our method to all the 
1475 designable N — 18 structures. For each structure we performed five experiments, for different 
random number seeds, each started from all 2 N possible sequences. Since the elimination is E(r, u)- 
based, only the good sequences survive if the simulation is sufficiently long. The average number of 
MC sweeps needed to single out the good sequences was 123000 (30 CPU seconds on DEC Alpha 
200). A very few experiments required up to 10 7 MC sweeps, while all five experiments converged in 
less than 500000 MC sweeps for 90% of the structures. This shows that the elimination procedure 
is both fast and robust. 



3.2 Other Methods 



3.2.1 Minimizing E(r 0} a) 



Maximizing P(r \a) [see Eq. ([!])] is equivalent to minimizing the quantity 

AFo(er) = -TlnP(r |<7) = E(r ,a) - F(a) , (14) 

where F(a) is the free energy of sequence a at temperature T. In the energy minimization method 
one approximates Ai*b(c) by replacing F(a) with a constraint that conserves the net hydropho- 
bicity to a preset value Nh, 

J2°i = N H- (15) 

i 

The reason for imposing this constraint is more fundamental than just guiding the sequence opti- 
mization to an appropriate net hydrophobicity. In e.g. the HP model one has a pure "ferromagnetic" 
system in terms of a% for a fixed tq. Hence, minimizing E(tq, a) with respect to a would result in a 
homopolymer with all monomers being hydrophobic. With the constraint in Eq. (|l^) present, this 
is avoided. 

In the relevant Nh is picked for the structure to be designed. However, this does not corre- 
spond to a "real-world" situation, where Nh is not known beforehand. When comparing algorithm 
performances in [jj a default constraint, Nh = N/2, was therefore used. Below, we will in our 
comparisons scan through all Nh and minimize E(ro,a) separately for each Nh- 
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For N — 16 and 18 all 456 respective 1475 different designable structures (see Table |l|) are subject 
to design by minimizing E{to,<j) for all Nh- If the resulting minima are non-degenerate for fixed 
Nh, the sequences are kept as candidates for good sequences, otherwise they are discarded. A check 
of the results obtained this way against exact data shows that there is at least one good sequence 
among the candidates for 87%/78% of the structures for N — 16 and 18, respectively. In these cases 
we say that the method successfully can design the structure. Another measure of the success of 
the method is given by the probability that an arbitrary generated candidate is good. In total, we 
obtained 939/3546 candidates out of which 46%/36% (435/1245 sequences) are good. Therefore, in 
order to get the relatively high success rates mentioned above, it is essential to be able to distinguish 
good candidates from bad ones. The cost of doing this is for long chains much larger than that of 
the energy minimization itself. 

In Table || the performance of the E(ro, er)-minimization methods for N = 16 and 18 is compared 
with other approaches with respect to design ability and CPU consumption. As can be seen, the 
multisequence method with its 100% performance, is indeed very fast. Furthermore, the performance 
of the E(ro, ^-minimization variants deteriorates with size. 





EOtq, cr)-minimization 










N H = N/2 


All N H 


High-T 


NMC 


MS 


HP N = 16 


25% 


87% 


70% 


100% 


100% 


HP N = 18 


21% 


78% 


50% 


100% 


100% 


CPU sec/structure 


O(O.l) 


0(1) 


O(0.1) 


O(10 3 ) 


O(10) 



Table 2: Number of structures that get designed by the different approaches for N — 16 and 
18; E(ro, a)-minimization with fixed Nh = N/2 and with scanning through all Nh, respectively, 
the nested MC approach of [jjj (NMC), and the multisequence method (MS). Also shown is the 
computational demand for N — 18 (DEC Alpha 200). 



3.2.2 High-T Expansion Crossings 

A more systematic approach, based on cumulant approximations of F(o~), has been advocated by 
Deutsch and Kurosky H, and a method along these lines has also been proposed by Morrissey 
Shakhnovich jf|. However, these are high-T approximations, and can fail at relevant design tem- 
peratures, as has been pointed out by Seno et al. JtJ. 

The importance of the choice of the design temperature is easily studied for short HP chains, for 
which the T dependence of P(ro\a) can be computed exactly. At T — the relative population of 
7'o, P(ro\cr), is equal to 1, 1/g, and for good, medium, and bad sequences, respectively. For good 
sequences, the temperature at which P(ro\a) = 1/2 is often referred to as the folding temperature. 

We calculated the T dependence of P(r Q \a) for one N — 16 structure from jjj, which has 1 good 
and 1322 medium sequences, and one N — 18 structure from M with 7 good and 2372 medium 
sequences. In the N = 18 case, it turns out that there are 667 medium sequences that have higher 
P(ro\a) than at least one of the good sequences at some T. We denote these as crossing sequences. 
Figure || shows the results for the 7 good sequences and 4 of the crossing sequences. In particular, 
one sees that in order for P(ro\o~) optimization to actually lead to a good sequence, it is necessary 
to work at a design temperature not much higher than the highest folding temperature. At such low 
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0.1 0.2 0.3 0.4 0.5 

T 



Figure 2: P(ro\a) versus T for the seven good sequences (solid) and for four of the crossing sequences 
(dashed) for the N = 18 structure in ||. 

temperatures, it is clear that high-T approximations are inappropriate. For the N = 16 structure 
it was demonstrated in that the method of || fails. Indeed, it turns out that this structure has 
296 crossing sequences. 

With these crossing phenomena, it is not surprising that the high-T expansion frequently fails as can 
be seen from the summary in Table ^j, from which it is also clear that the performance deteriorates 
when increasing TV from 16 to 18. 

MC methods have the advantage that the design temperature can be taken low enough to avoid 
crossing problems, without introducing any systematic bias. Still, in practise, it is of course not 
possible to work at too low design temperatures, due to long decorrelation times at low T. It is 
therefore important to note that the multisequence E(r, c)-based elimination multisequence method 
can be carried out at any temperature without running the risk of eliminating any good sequences. 



3.3 N=32 

Having compared different methods for short chains, we now turn to longer chains focusing on 
multisequence design. For long chains it is not feasible to explore the entire sequence space. On the 
other hand, it is expected that, for a given target structure, there are several positions along the 
chain where most of the good sequences share the same 0$ value (see e.g. [jl6[ |l7|]); in other words, 
some positions are effectively frozen to H or P. A natural approach therefore is to restrict the search 
by identifying and subsequently clamping such Oi to H or P. For this purpose it is convenient to use 
a set of short trial runs, as was shown in Q, using the target structure in Fig. ||. For this structure 
ten <7, were clamped to H (filled circles in Fig. |3|) and ten to P (open circles). Sequence optimization 
is then performed with the remaining twelve (crosses) left open. 
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Figure 3: Target structure for N = 32. Symbols are explained in the text. 



This clamping method can of course be generalized to a corresponding multi-step procedure for very 
long chains. 

Taking the target structure in Fig. || as example, with the search restricted to 2 12 sequences as 
described above, we now discuss two other important issues. First, we compare the efficiency of 
E(r, <r)-based elimination to that of P(cr)-based elimination. In Fig. [I| we show the number of 
remaining sequences, N r , against MC time in three runs for each of the two methods (T = 1/3, 1 
CPU hour or less per run). E{r, cr)-based elimination is very fast in the beginning, and a level is 
quickly reached at which it is easy to perform a final multisequence simulation for the remaining 
sequences. The curves level off at relatively high N r , indicating that there are many good sequences 
for this structure (these runs were continued until all three contained the same 167 sequences). The 
three runs with P(<r)-based elimination, which were carried out using 50000 MC sweeps for each 
elimination step and A = 2 [see Eq. (|lT|)l, were continued until five sequences or fewer were left. 
The results were checked against those of the long multisequence simulations discussed in Sec. 0, 
and were found to be quite stable in spite of the fact that the runs were short. In particular, the 
best sequence (sequence A of Table || below) was among the survivors in all three cases. 

Next we take a look at the distribution P(a). The performance of the design procedure is crucially 
dependent on the shape of this distribution, especially when P(cr)-based elimination is used. One 
runs into problems if the distribution is dominated by a relatively small number of sequences with 
high P(cr). It is therefore interesting to see how the shape of P(a) evolves as the elimination process 
proceeds. Figure ||a shows the entropy of P(cr), 



in a run with P(a)-based elimination. With jV r remaining sequences, the maximal value of H is 
log 2 N r , corresponding to a uniform distribution P(cr). As can be seen from Fig. gp,, after a few 
elimination steps, H is close to this limit. The desired behavior of the marginal distribution of r, 
P(r), is in a sense the opposite, since the weight of the target structure should become large. The 
evolution of P(ro) in the same run is shown in Fig. ^Jd. 




(16) 
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400 
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1 2 3 4 5 

MC sweeps (xlCT 6 ) 

Figure 4: iV r against MC time for three runs with P(cr)-based elimination (full lines) and three with 
E(r, cr)-based elimination (dashed lines) for the target structure in Fig. || (see text). 




Step Step 

Figure 5: The evolution of (a) the entropy of -P(cr) and (b) the marginal probability P(ro) in a 
run with P(er)-based elimination (T = 1/3, A = 1, 10 7 MC sweeps for each elimination step) for 
the target structure in Fig. |[ The line in (a) shows log 2 N r , where A^ r is the number of remaining 
sequences. 

3.4 N=50 

A test of any design procedure consists of three steps: 

1. Finding a suitable target structure. 

2. Performing the actual design. 

3. Verifying that the final sequence is good. 

In this section we discuss the design of a N = 50 structure. For this system size the first step is 
highly non-trivial. Also, the verification part is quite time-consuming. For these reasons we focus 
on one example and go through each of the steps in some detail. 
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3.4.1 Finding a Suitable Target Structure 



We begin with the problem of finding a suitable target structure. For a randomly chosen structure 
it is unlikely that there is any sequence that can design it; the fraction of designable structures is 
e.g. about 0.00025 for N = 18 (see Table ||). Furthermore, this fraction decreases with system 
size. Rather than proceeding by trial and error, we therefore determined the target structure by 
employing a variant of our sequence design algorithm. In this version no target structure is specified 
and Eq. (||) is replaced by 

g(a) = —E min (a)/T, (17) 

where E m i n (a) ideally should be the minimum energy for the sequence a. In our calculations, 
since the minimum energy is unknown, we set E m - m (a) equal to the lowest energy encountered 
so far. Except for this change of the parameters g(a), we proceed exactly as before, using P(cr)- 
based elimination. However, a sequence is never eliminated if its g(<r) was changed during the last 
multisequence run, that is if a new lowest energy was found. 

With this algorithm, one may hope to identify and eliminate those sequences that are bad not only 
with respect to one particular structure, but with respect to all possible structures. Clearly, this is 
a much more ambitious goal, and it should be stressed a careful evaluation of the usefulness of this 
approach is beyond the scope of the present paper. 

This calculation was started from a set of about 2200 sequences. These were obtained by first 
randomly generating a mother sequence, with probability 0.65 for H, and then randomly changing 
this at one to three positions. Thus, there is a high degree of similarity between the sequences, 
which ensures a reasonable acceptance rate for the sequence update. After 37 elimination steps 
(T = 1/2.8, A = 1.5, 2 x 10 5 MC sweeps for each elimination step), three of the sequences were 
left. The best of these sequences and its minimum energy structure can be found in Fig. ||a. Note 
that this sequence does not minimize the energy for any fixed Nh — the energy can be reduced 
by interchanging the monomers i — 19 and 43 (i = 1 corresponds to the lowest of the two end 
points in Fig. ^a). In what follows we take this structure as our target structure, without using any 
information about the particular sequence shown. 



3.4.2 Sequence Design 



We began the sequence design for this structure by performing ten short runs, each started from 10 5 
random sequences. Based on these, 27 oi were clamped to H and 12 to P, as illustrated in Fig. 
It is interesting to compare these results to the original sequence in Fig. ||a. As expected, there is 
a close similarity, but there are also three positions along the chain at which <Ji is clamped to the 
opposite value compared to the original sequence [i — 2, 19 and 43). Thus, the original sequence 
does not belong to the restricted sequence set which we study next. 

Having restricted the search, we proceed in two steps. First, we apply E(r, er)-based elimination. 
As in the corresponding TV = 32 calculation, the number of remaining sequences rapidly reached a 
fairly stable and high level, indicating that there are many sequences with the target structure as 
unique ground state. The number of sequences surviving this first step was 832. The second step is 
a simulation with P(cr)-based elimination (T = 1/2.8, A = 1, 10 7 MC sweeps for each elimination 
step). This step was repeated three times using different random number seeds, each time starting 
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Figure 6: (a) Target structure for N = 50. This structure and the sequence shown were obtained 
using the design algorithm without fixed target [see Eq. (0)] . (b) Results of the clamping procedure 
for our TV = 50 target structure. Symbols are the same as in Fig. ||. 



from the same 832 sequences. The stability of the results was not perfect, but the best sequence 
found was the same in all three runs. This sequence has four H and seven P at the positions left 
open after clamping. The four positions that were assigned an H are i — 10, 11, 18 and 28. 



3.4.3 Verification 



In order to check the designed sequence, we performed an independent simulated-tempering calcu- 
lation. As mentioned in Sec. 2.2, in a previous study jL5j, this method successfully found the ground 
state of a N — 64 HP chain |15|. The length of our N = 50 simulation is 2 x 10 9 MC sweeps. In 
the simulation of the designed N = 50 sequence the target structure was visited many times; we 
estimate the number of "independent" visits to be about 30. By contrast, no other structure with 
the same or lower energy was encountered. We take this as strong evidence that the target structure 
indeed is a unique energy minimum for this sequence. 



Similar simulations were also performed for two other N = 50 sequences, SI and S2. The sequence 

51 is the one shown in Fig. |6^,, and S2 is the one obtained by assigning P to all open positions in 
Fig. ^b. At first sight SI may not seem to fit the target structure very well; as already noticed, 
this sequence does not minimize the energy of the target structure for fixed Njj- Nevertheless, our 
results suggest that both SI and S2, like the designed sequence, have the target structure as unique 
ground state. However, the dominance of this structure sets in at a lower temperature for SI and 

52 than for the designed sequence; rough estimates of the folding temperatures are 0.27 for the 
designed sequence and 0.23 for SI and S2. 



Unfortunately, it was not feasible to evaluate alternative methods for this system size, because the 
verification part is too time-consuming. Let us note, however, that our designed sequence uniquely 
minimizes the energy of the target structure for fixed Nh = 31. Sequence SI, on the other hand, 
appears to be good too, even though it does not minimize the target energy for any Nh- 
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Figure 7: The probability distributions of (a) the similarity parameter Q and (b) the energy E for 
the N = 50 sequences S2 at T = 0.227 « T f . 



3.5 The Folding Transition for N=50 



The simulated-tempering runs for the three N = 50 sequences provide thermodynamic data over 
a wide range of temperatures. In particular, they offer an accurate picture of the behavior at the 
folding transition. Shown in Fig. [7] are the probability distributions of the similarity parameter 
Q (number of contacts that a given conformation shares with the native state) and the energy E, 
close to the folding temperature Tf for sequence S2. The corresponding results for the other two 
sequences are qualitatively similar. 

From Fig. [j]a it can be seen that the distribution P(Q) has an essentially bimodal shape. The peak 
at Q — Q m ax = 34 corresponds to the native state and contains, by definition, around 50% of the 
distribution. The non-native peak, at Q/Q max ~ 0.4 — 0.6, is well separated from the native one. 
showing that the transition is cooperative in the sense that the system is either in the native state or 
in states that are structurally very different. It must be stressed, however, that it is not a two-state 
transition — the non-native part does not correspond to one ensemble of unfolded structures, but 
rather to a number of distinct folded low-energy states. The ruggedness of the non-native peak of 
P{Q) is an indication of this, and it becomes evident from the energy distribution of Fig. f^b, which 
shows no trace of bimodality. The fact that it is not a two-state transition is in line with general 
arguments for two-dimensional models [^0[ |l| . 



4 The Multisequence Method 



The multisequence method, which is a key ingredient in our design algorithm, was originally applied 
to a simple off-lattice model Q in the context of folding studies. Using parameters g(a) that had 
been adjusted so as to have an approximately uniform distribution in er, it was found to be much 
more efficient than a standard MC. In this paper we have instead chosen g(a) according to Eq. ([)]). 
This simple choice is not possible for a random set of sequences. The efficiency can, however, be 
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Sequence A HHPP HHPP PPHP HPPP PHPH PPPP HHPP HHHH 
Sequence B HHHP HHPP PPHP HHPP PHPH PPPP HHPH HHHH 
Sequence C HHPP HHPP PPHP HPPP PHPH PPPP HHPH PHHH 



Table 3: Three N = 32 HP sequences. 





1/T = 2.8 


1/T = 3.1 


1/T = 3.4 


Sequence A 


Standard MC 
Multisequence 


0.227 ±0.005 
0.234 ±0.006 


0.532 ±0.010 
0.520 ±0.010 


0.752 ±0.015 
0.732 ±0.008 


Sequence B 


Standard MC 
Multisequence 


0.0389 ± 0.0024 
0.0383 ±0.0013 


0.086 ±0.007 
0.095 ± 0.003 


0.166 ±0.021 
0.166 ±0.006 


Sequence C 


Standard MC 
Multisequence 


0.00251 ± 0.00012 
0.00250 ± 0.00009 


0.0066 ± 0.0003 
0.0066 ± 0.0003 


0.0133 ±0.0008 
0.0123 ±0.0005 



Table 4: Comparison of results for P(r^\a) obtained by two different methods, the multisequence 
algorithm and a standard fixed-sequence MC. Shown are results for the three sequences in Table || 
for three different temperatures. 



quite good after removal of bad sequences. To illustrate this, we take a set of 180 surviving N = 32 
sequences, from one of the three runs with E{r, a)-based elimination in Fig. ||. 

For these sequences we carried out multisequence simulations at three different temperatures, 
T=l/2.8, 1/3.1 and 1/3.4. The results of these simulations are compared to those of single-sequence 
simulations with identical r updates, for the three different sequences shown in Table ||. Sequence 
A is the best sequence found among all the 180. As can be seen from Table it has a folding 
temperature close to T = 1/3.1. Sequences B and C were deliberately chosen to represent different 
types of behavior, and have lower folding temperatures. It is interesting to note how different the 
sequences A and C behave (see Table in spite of the fact that they differ only by an interchange 
of two adjacent monomers. 

As the number of sweeps is the same, 10 9 , and since the cost of the additional sequence moves in 
the multisequence runs is negligible, we can directly compare the statistical errors from these runs. 
In Table ^ the averages and statistical errors for the quantity P(ro\a) are shown. The errors quoted 
are la errors, obtained by a jackknife procedure. 

From Table | it can be seen that the two methods give similar statistical errors at the highest T 
studied, which lies above the folding temperature for all three sequences. It should be stressed that 
equal errors implies that the multisequence method is faster by a factor of 180, since a single run 
covers all sequences with this method. Although there is a dependence upon sequence, there is 
furthermore a clear tendency that the errors from the multisequence runs get smaller than those 
from the single-sequence runs at lower T. The difference is largest for sequence B and the lowest 
temperature. In this case the errors differ by a factor of 3.5, which corresponds to an extra factor 
of 10 in computer time, in addition to the trivial factor of 180. 

This simple choice of g(a) [Eq. (Q)] has been used with success in all our calculations. Neverthe- 
less, let us finally note that multisequence design can also be applied using other g(<r) values. In 
particular, it is easy to modify Eq. (|Io| ) for general g{cr). 
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5 Off-Lattice Model Results 



Lattice models offer computational and pedagogical advantages, but the results obtained on the 
lattice must be interpreted with care; for example, it is has been shown that the number of designablc 
structures drastically depends on the lattice type in the HP model JllJ . In this section we therefore 
show that our design procedure can be applied essentially unchanged to a 3D minimalist off-lattice 
model |pH . While similar models have been studied before, see e.g. JlJ, ||[ |24|, it is the first 
time, as far as we know, that sequence design is performed in an off-lattice model based on sampling 
of the full conformation space. 

One problem encountered in going to off-lattice models is in the very formulation of the stability 
criterion. Clearly, it is the probability of being in the vicinity of the target structure ro that we are 
interested in, rather than the probability of being precisely in ro . While this point can be relevant 
for lattice models too, it is of more obvious importance in the off-lattice case. Throughout this 
paper, we stick to the probability P(ro\<r) corresponding to a single target structure r , using target 
structures that are obtained by energy minimization. In the general case, it might be necessary to 
consider instead the off-lattice analogue of the left hand side of Eq. (||) . 

Another problem is the elimination criterion for bad sequences. A straightforward implementation 
of E(ro, cr)-based elimination requires the introduction of a cutoff in structural similarity to ro, below 
which elimination should not take place. However, with such a cutoff, the method is too slow, since 
in order to have a reasonable elimination rate, it appears necessary to employ some sort of quenching 
procedure, which tends to be very time-consuming. By contrast, we found P(cr)-based elimination 
to be useful for off-lattice chains too, without any modifications or additional parameters. 

For the off-lattice model in contrast to the HP model, one does not have access to a set of small N 
exact enumerations results. Hence, for all sizes we need to go through the three steps needed for 
N > 18 HP chains: find suitable structures, perform design and verify that the designed sequence 
is stable in the desired structure. 



5.1 The Model 



Like the HP model, the 3D off-lattice model |ll| contains two kinds of residues, hydrophobic (er.j = 1) 
and hydrophilic (cr; = 0). Adjacent residues are linked by rigid bonds of unit length, bi, to form 
linear chains. The energy function is given by 

N-2 N-3 N-2 N / 1 1 \ 

E(b; a) = -k x b % ■ b i+l - k 2 X! ^ ' + 4 ^2 ^2 e ( a ^ a j) I ~n eT ( 18 ) 

i=i i=i i=i j=i+2 r v / 

where r^ denotes the distance between residues i and j. The first two sequence-independent terms 
define the local interactions, which turn out to be crucial for native structure formation Jil| |. The 
parameters are chosen as K\ = — 1 and «2 = 0.5 in order to obtain thermodynamically stable struc- 
tures, and to have local angle distributions and bond-bond correlations that qualitatively resemble 
those of functional proteins. The third term represents the sequence- dependent global interactions 
modeled by a Lennard- Jones potential. The depth of its minimum, e(<7j, cry), is chosen to favor the 
formation of a core of hydrophobic residues by setting e(0, 0) = 1, e(l, 1) = e(0, 1) = e(l, 0) = |. 
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To monitor structural stability we use the mean-square distance S 2 b between two arbitrary configu- 
rations a and b. An informative measure of stability is given in terms of the probability distribution 
P(S 2 ) of 5 2 b , and the corresponding mean, (8 2 ). The latter is small if the structural fluctuations 
are small, but tells nothing about the actual structure. In addition, we therefore measure the sim- 
ilarity to the desired structure tq. For this purpose we average S 2 b over configuration a, keeping 
configuration b fixed and equal to tq. This average will be denoted by (Sq). 

When investigating thermodynamic properties of this model one finds a strong dependence upon 
the local interactions. This impact of local interactions is not a peculiar property for off-lattice 
models. Indeed, similar findings have been reported for the HP lattice model |]l9| . 



5.2 Design Results 
Finding Suitable Structures 

We have determined the global energy minima, or native structures, for a number of N = 16 
sequences, and six of these structures are used as target structures in our design calculations. In 
addition, we consider six N — 20 target structures, which are native states of sequences studied 
in fll| . We restrict ourselves to these twelve examples because the verification of the design results, 
the computation of (S 2 ), is time-consuming. This selection of structures studied represent no bias 
with respect to the performance of the design algorithm. As can be seen from Tables | and |, some 
of the original sequences represent good folders (small (6 2 )) whereas others do not (large (S 2 )). An 
example of a N = 20 target structure can be found in [fLy . 



Designing the Sequences 

As discussed above, in our off-lattice calculations, we use P(<r)-based elimination, which, unlike 
E(r, <r)-based elimination, can be used as it stands. All our design calculations are carried out at 
the temperature T = 0.3, whereas the highest folding temperatures measured in |nj are close to 
0.2. This somewhat high design temperature was chosen in order to speed up the calculations. It 
is still low enough for design of stable sequences, as will become clear from the verification below. 
These verification calculations are performed at T = 0.15, using simulated tempering. 

Our P(er)-based design calculations starts out from the set of all 2 N possible sequences. Each 
iterative step amounts to a relatively short multisequence simulation consisting of 500000 MC cycles 
for the N r remaining sequences, followed by removal of those sequences for which the estimated P{a) 
fulfills Eq. ( pi] ) with A = 1.5. This is continued until a single sequence remains, which typically 
requires around 150 steps. The final sequence we take as the MS designed sequence. Each MC cycle 
consists of one attempt to update the conformation and one for the sequence. The conformation 
update is either a rotation of a single bond bi or a pivot move. The time consumption for the studied 
N = 16 and 20 chains ranges from three to six CPU hours. 

In our multisequence and simulated-tempering simulations each MC sweep in conformation space 
is followed by one attempt to update the sequence or temperature. The sequence and temperature 
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updates are both ordinary Metropolis steps 27 . The CPU cost of these updates is negligible 
compared to that of the conformation update. 



The designed sequences are shown in Tables |and | for N = 16 and 20, respectively. Also shown 
are the results of "naive" energy minimization ^ . Ideally, one should use this meth od b y scanning 
through all possible Ng [Eq. (fL5|)], which was done for N < 18 HP chains in Sec. ^2. However, 
given that the verification is quite tedious, we have chosen to use a single Njj only, corresponding to 
the original sequence. In other words the E(tq, ^-minimization method is given a slight advantage 
as compared to what would have been the case for a real-world application. 





method 


(T 


(<5^}t=0.15 


<<5o)t=0.15 


16-1 


target 


1111100101101111 


0.01 ± 0.0002 


0.01 ± 0.002 




MS 


1111100101111111 


0.01 ± 0.0002 


0.01 ± 0.002 




E(r ,a) 


1111100101011111 


0.02 ± 0.003 


0.01 ± 0.007 


16-2 


target 


1011001110011110 


0.07 ± 0.003 


0.04 ± 0.004 




MS 


1011001110011110 


0.03 ± 0.004 


0.02 ± 0.007 




E(r Q ,a) 


1111001010101110 


0.38 ± 0.03 


0.52 ± 0.02 


16-3 


target 


1010101001101111 


0.24 ± 0.05 


0.13 ± 0.02 




MS 


1111111101001111 


0.01 ± 0.001 


0.01 ± 0.006 




E(r ,a) 


1010101101001111 


0.08 ± 0.02 


0.04 ± 0.02 


16-4 


target 


1101101000010011 


0.38 ± 0.02 


0.25 ± 0.02 




MS 


1111101111010011 


0.12 ± 0.01 


0.36 ± 0.01 




E(r Q ,a) 


1010101000010111 


0.28 ± 0.01 


0.24 ± 0.02 


16-5 


target 


1001110011111111 


0.47 ± 0.02 


0.33 ± 0.02 




MS 


1001110010111111 


0.12 ± 0.002 


0.10 ± 0.01 




E(r ,a) 


1011110010111111 


0.11 ± 0.004 


0.11 ± 0.01 


16-6 


target 


1110010000000110 


0.64 ± 0.007 


0.57 ± 0.02 




MS 


1101111110101111 


0.30 ± 0.02 


0.34 ± 0.02 




E(r ,a) 


0101010000101010 


0.28 ± 0.02 


0.42 ± 0.01 



Table 5: Design results for six N = 16 off-lattice target structures. For each structure three 
sequences are listed together with the corresponding (<5 2 ) and (Sq): the sequence used to generate 
the target structure ("target"), and the sequences obtained by multisequence design (MS) and 
E(tq, (r)-minimization, respectively. 



Verification 



To assess the quality of the designed sequences, we measured the mean-square distances to their 
respective target structures, (Sq), using simulated tempering. In Tables || and || we give both (Sq) 
and (S 2 ) at T = 0.15 for each of the sequences. From these tables a few features emerge: 



• For target structures where the original sequence is good (small (Sq)), the multisequence 
approach either returns the original sequence or finds an even better sequence. 

• For target structures where the original sequence is bad (high (Sq)), the multisequence ap- 
proach often finds sequences with significantly lower (Sq). 



18 





method 


a 


(<P)T=0.15 


<<5o)r=0.15 


20-1 


target 


11110011110110111001 


0.08 ± 0.01 


0.04 ±0.01 




MS 


11110011110010111001 


0.02 ± 0.001 


0.01 ±0.002 




E(r ,a) 


11110011111110101001 


0.27 ± 0.04 


0.29 ± 0.01 


20-2 


target 


11110110101100111011 


0.27 ± 0.05 


0.15± 0.01 




MS 


11110100100100111111 


0.02 ± 0.004 


0.01± 0.003 




E(r ,a) 


11110010101010111111 


0.24 ± 0.05 


0.93± 0.01 


20-3 


target 


11100100101001010101 


0.30 ± 0.04 


0.38 ± 0.01 




MS 


11111100101001010111 


0.10± 0.02 


0.12 ± 0.01 




E(r ,a) 


10101000101001010111 


0.59 ± 0.02 


0.53 ± 0.01 


20-4 


target 


01101111010110111110 


0.24 ± 0.02 


0.34± 0.01 




MS 


01101010010111111110 


0.05 ± 0.01 


0.03 ± 0.01 




E(r ,a) 


01101011010111111110 


0.10 ± 0.01 


0.05± 0.01 


20-5 


target 


01111110111101101100 


0.46 ± 0.04 


0.29 ± 0.01 




MS 


11111110100101111101 


0.46 ± 0.04 


0.43± 0.01 




E(r ,a) 


01111110100101111101 


0.65 ± 0.09 


0.46± 0.01 


20-6 


target 


01100111000101011010 


0.73 ± 0.01 


0.75 ±0.01 




MS 


11100111001101011111 


0.64 ± 0.02 


0.73 ± 0.01 




E(r ,(x) 


01111010100101001010 


0.52 ± 0.08 


0.91 ± 0.01 



Table 6: Design results for six N = 20 off-lattice target structures. The corresponding sequences 
are those from Table 1 in jllj but here ordered according to decreasing (S 2 ). Same notation as in 
Table |. 



• With only one exception, structure 16-4, the results are better or much better for multisequence 
design than for the energy minimization method. For structure 16-4, the (S 2 ,) values are 
relatively high for both methods, as well as for the original sequence. 

It should be stressed that in those instances where the multisequence approach fails to find a good 
sequence, the original sequence is bad, too. Hence, it is likely that these target structures do not 
represent designable structures. 

While this very simple implementation of multisequence design has been tested with success, it 
should be kept in mind that there are a number of possible improvements. As already mentioned, 
it would, for example, in off-lattice problems be more natural to maximize the fuzzy version of the 
conditional probability in Eq. (|J), rather than the one referring to a single structure tq used here. 



6 Biological Implications 

Sequence Design, the inverse of protein folding, is of utmost relevance for e.g. drug design. The 
study of the statistical mechanics of protein folding is hampered by well-known computational 
difficulties. In sequence design, the major difficulty is to ensure that the designed sequence has 
the target structure as its global energy minimum. It is the ambitious goal of the multisequence 
design method to achieve that, by a simultaneous search of conformation and sequence spaces. As 
it stands, the method is applicable to a fairly wide range of hydrophobic/polar models. 
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7 Summary 



A novel MC scheme for sequence optimization in coarse-grained protein models has been presented 
and tested on hydrophobic/polar models. With simultaneous moves in both sequence and confor- 
mation space according to a judiciously chosen joint distribution, an efficient way of maximizing the 
corresponding conditional probabilities emerges, in which two different prescriptions are given for 
removing sequences not suitable for the target structures. One is a simple energy comparison that 
can be applied to lattice models, whereas the other one is based upon the marginal distribution 
P{cr) and can be applied to both lattice and off-lattice models. 

The potential memory problem of keeping track of removal of most of the 2 N sequences for large N 
is dealt with by an iterative method, capitalizing on the fact that the assignment of certain positions 
in the chain tend to get "frozen" to hydrophobic/polar residues. Furthermore, a modified algorithm 
was tentatively explored that addresses the problem of finding designable structures. This is highly 
relevant given that structures differ widely in designability (f7[ ^8[ . 

Our design method is evaluated on a number of 2D lattice (JV = f 6, 18, 32 and 50) and 3D off-lattice 
(N = 16 and 20) structures with the following results: 



• For N = 16 and 18 lattice chains, where the results can be gauged against exact enumeration, 
the results come out extremely well both with respect to performance and efficiency. In this 
context we also compare with and discuss other non-exact approaches — E{tq, ^-minimization 
and high-T expansion. With respect to the former, we give, in contrast to other comparisons 
in the literature, the approach a fair chance by scanning over all possible net hydrophobicities. 

• For N > 18 lattice chains, finding suitable design structures and verifying good folding prop- 
erties of the designed chains is not trivial. For iV = 32 a suitable structure was designed "by 
hand" , whereas for N = 50 a more systematic procedure was employed, where a variant of 
the multiscquence approach was used to find a designable structure. For both N — 32 and 
N = 50 structures the results from the design procedure were verified to be correct. 

• For N — 16 and N — 20 off-lattice chains, a set of structures representing both good and bad 
folders were used to test the design method. For good folding sequences, the design procedure 
either identifies the original sequence or finds a sequence with improved folding properties. 
In the case of bad folding sequences, the design procedure typically finds a sequence with 
improved folding properties. 



We also separately evaluate the efficiency of the multisequence approach as compared to standard 
MC for ordinary thermodynamic folding simulations. Such a test was carried out in Q using 
carefully tuned parameters g(a). The results presented here show that it can be less expensive to 
fold 100-1000 chains simultaneously than a single one, even with a simple choice of g(a) [Eq. (B)]. 

The size of the sequence optimization problem increases rapidly with the size of the alphabet, and 
our approach is, as it stands, not practical for models with twenty amino acids. What might be 
feasible in this case is an approach along the lines of ^9|, where it was shown that an accurate 
description of the widely used Miyazawa-Jernigan 20 x 20 interaction matrix |3(J can be obtained 
in terms of its first two principal components. 
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